Estimation and correction for GC-content bias in high throughput sequencing
نویسندگان
چکیده
GC-content bias describes the dependence between fragment count (read coverage) and GC content found in high-throughput sequencing assays, particularly the Illumina Genome Analyzer technology. This bias can dominate the signal of interest for analyses that focus on measuring fragment abundance within a genome, such as copy number estimation. The bias is not consistent between samples, and current methods to remove it in a single sample do not assume any knowledge of the curve shape or scale. In this work we analyze regularities in the GC-bias patterns, and find a compact description for this curve family. It is the GC content of the full DNA fragment, not only the sequenced read, that most influences fragment count. This GC effect is unimodal: both GC rich fragments and AT rich fragments are under-represented in the sequencing results. Based on these findings, we propose a new method to calculate predicted coverage and correct for the bias. This parsimonious model produces single bp prediction which suffices to predict the GC effect on fragment coverage at all scales, all chromosomes and for both strands; this allows optimal GC-effect correction regardless of the downstream smoothing or binning. We demonstrate our model’s potential for improving on current approaches to copy-number estimation. These GC-modeling considerations can also inform other high-throughput sequencing analyses such as ChIP-seq and RNA-seq. Finally, our analysis provides empirical evidence strengthening the hypothesis that PCR is the most important cause of the GC bias.
منابع مشابه
Summarizing and correcting the GC content bias in high-throughput sequencing
GC content bias describes the dependence between fragment count (read coverage) and GC content found in Illumina sequencing data. This bias can dominate the signal of interest for analyses that focus on measuring fragment abundance within a genome, such as copy number estimation (DNA-seq). The bias is not consistent between samples; and there is no consensus as to the best methods to remove it ...
متن کاملSystematic bias in high-throughput sequencing data and its correction by BEADS
Genomic sequences obtained through high-throughput sequencing are not uniformly distributed across the genome. For example, sequencing data of total genomic DNA show significant, yet unexpected enrichments on promoters and exons. This systematic bias is a particular problem for techniques such as chromatin immunoprecipitation, where the signal for a target factor is plotted across genomic featu...
متن کاملAnalytical Biases Associated with GC-Content in Molecular Evolution
Molecular evolution is being revolutionized by high-throughput sequencing allowing an increased amount of genome-wide data available for multiple species. While base composition summarized by GC-content is one of the first metrics measured in genomes, its genomic distribution is a frequently neglected feature in downstream analyses based on DNA sequence comparisons. Here, we show how base compo...
متن کاملThe impact of RNA secondary structure on read start locations on the Illumina sequencing platform
High-throughput sequencing is subject to sequence dependent bias, which must be accounted for if researchers are to make precise measurements and draw accurate conclusions from their data. A widely studied source of bias in sequencing is the GC content bias, in which levels of GC content in a genomic region effect the number of reads produced during sequencing. Although some research has been p...
متن کاملSensitivity of Noninvasive Prenatal Detection of Fetal Aneuploidy from Maternal Plasma Using Shotgun Sequencing Is Limited Only by Counting Statistics
We recently demonstrated noninvasive detection of fetal aneuploidy by shotgun sequencing cell-free DNA in maternal plasma using next-generation high throughput sequencer. However, GC bias introduced by the sequencer placed a practical limit on the sensitivity of aneuploidy detection. In this study, we describe a method to computationally remove GC bias in short read sequencing data by applying ...
متن کامل